Skip to main content

Full text of "mit :: lcs :: tr :: MIT-LCS-TR-362"

See other formats


mmm 



MIT/LCS/TR-3€1 



k 



©if 




Bradl»y Clair to* 



il'. f. 




■• ■ 






"m 






t 



Simulating Applicative Architectures on the Connection Machine 

by 

Bradley Clair Kuszmaul 

S.B. Computer Science and Engineering, M.I.T. (1984) 

S.B. Mathematics, M.I.T. (1984) 

Submitted to the 

Department of Electrical Engineering and Computer Science 

in partial fulfillment of the 

requirements of the degree of 

Master of Science 

in Electrical Engineering and Computer Science 

at the Massachusetts Institute of Technology 

June, 1986 
©Massachusetts Institute of Technology 1986 



Signature of Author 



Department of Electrical Engineering and Computer Science 

May 9, 1986 



Certified by 



Jack B. Dennis 
Thesis Supervisor 



Accepted by 



Arthur C. Smith 
Chairman, Committee on Graduate Students 



Simulating Applicative Architectures on the Connection Machine 

by 

Bradley Clair Kv.szm.aul 

S.B. Computer Science and Engineering, M.I.T. (1984) 

S.B. Mathematics, M.I.T. (1984) 

Submitted to the Department of Electrical Engineering and Computer Science on May 9, 

1986 in partial fulfillment of the requirements of the degree of Master of Science in Electrical 

Engineering and Computer Science at the Massachusetts Institute of Technology 

Abstract: The connection machine (CM) is a highly parallel single instruction multiple 
data (SIMD) computer, which has been described as 'a huge piece of hardware looking for a 
programming methodology' [Arv]. Applicative languages, on the other hand, can be described 
as a programming methodology looking for a parallel computing engine. By simulating archi- 
tectures that support applicative languages ('applicative architectures') (e.g., data flow and 
reduction architectures) on the CM we can achieve the following goals: 

• Quickly and easily experiment with the design and implementation of applicative archi- 
tectures. 

• Run large applicative programs effeciently enough to gain useful experience. 

• Support programming environments that allow us to do general purpose computation 
on the CM. 

We describe the techniques which we use to simulate applicative architectures on the CM, and 
the discuss implications for the generalized case of simulating multiple instruction multiple 
data (MIMD) systems on single instruction multiple data (SIMD) computers. We describe the 
results of our simulations, concluding that the CM can run applicative programs efficiently, 
even though the CM was not explicitly designed for that task. 
Thesis Supervisor: Jack B. Dennis 

Title: Professor of Computer Science and Engineering. 

Keywords: Applicative Architectures, Connection Machine, Combinators, Data flow. Simula- 
tion. 



% ■• ■ 



IVJTt 




Contents 



8 

1 Introduction 

■ • 9 

1.1 Background 

^ , 10 

1.2 Results 

1.3 Floor plan 

12 

2 The Connection Machine 

12 

2.1 Connection Machine Hardware 

13 

2.2 Connection Machine Software 

2.2.1 Context Manipulation Macro Instructions l 

14 

2.2.2 Arithmetic Macro Instructions 

15 

2.2.3 Global Or Wire Operations 

15 

2.2.4 Message Sending Macro Instructions 

22 

2.2.5 Consing Macro Instructions 

.25 

2.2.6 Virtual Processors 

28 
3 Applicative Architectures 

. ^ 28 

3.1 Static Data Flow 

28 

3.1.1 Static Data Flow Program Graphs 

29 

3.1.2 Static Data Flow Structure Storage 

A ^t 

3.2 VIM Style Dynamic Data Flow 



3 



3.2.1 Wk* * Vni 44 

S^2 Tkm PilwMim DM toi imhto Vw 45 

S.2.8 Iuijl Mi lllfat ftt VlM Pii i»i H .»i 47 

S.3 rnmhiTi.tm !.<■■< i.n » 

4 SterTUk ** 

5 Conchakm *• 

5.1 Pwiwma^tf tfcrSiawi*t«i ..... 5« 

5.2 Q wdiUtn* 1«««M* «» 

5.3 F«§«m W«k «l 



List of Figures 



2.1 The global or wire is implemented as a global or tree 16 

2.2 A naive implementation of cm: get which works if not too many processors are 
fetching from any given processor 18 

2.3 Fan-in trees are a high level convention which can handle the problems associ- 
ated with using cm: send- data 19 

2.4 Sorting and scanning to make cm: send-with-add run quickly 21 

2.5 Copying graphs in constant time 24 

2.6 A program to implement cm: cons in terms of cm: enumerate 26 

2.7 Enumeration by subcube induction 27 

3.1 When a third pointer is needed to a cons cell, we copy the cons cell 30 

3.2 When copying a cons cell, the data needs to have the copy operation performed 
also. In this case, the whole tree needs to be copied because all the cons cells 

are 'saturated' 31 

3.3 If all the leaves of the tree are equal, it can take fi(m 2 n) PE's to represent m 
copies of a structure with only n distinct cons cells 37 

3.4 In the best case it takes 0(n + m) PE's to represent a graph with n distinct 
cons cells and m pointers into the graph 37 

3.5 Translating lexically scoped functions into closures 46 



mmmmmm 



3.6 R m jinwifiU te 
pcfelaf »»m4 



S.l 

LS SUtkdwt* 



n copiw of » fwgii with m nfllm fa taw wkkk k 
..... 49 

!!■ I fill fjlff ■»■ » 



Acknowledgements 

I would like to thank my thesis advisor, Jack Dennis, for his guidance and encouragement, 
Mike Dertouzos for finding money for me when no one else could, and Guy Steele for helping 
to deal with the hair of doing off campus research. 

Thinking Machines Corporation provided a connection machine 1 and software support. 
Too many TMC people are involved to name them all, but they include Danny Hillis, Cliff 
Lasser, Brewster Kahle, John Rose and Guy Steele. 



'The phrase "connection machine" is a registered trademark of Thinking Machines Corporation. 



Chapter 1 



Introduction 



The connection machine (CM) is a highly parallel single instruction multiple data (SIMD) 
computer[Hil85], which has been described as 'a huge piece of hardware looking for a program- 
ming methodology'fArv], while applicative architectures can be described as a programming 
methodology for parallel computation looking for a parallel computing engine. 

We have simulated architectures which support applicative languages ('applicative archi- 
tectures') on the CM in order to try to understand how to design scalable architectures to 
perform efficient general purpose parallel computing. 

We define a general purpose parallel computer to be a computer that can exploit parallelism, 
when present, in any algorithm[AI85]. A scalable architecture is an architecture that allows 
us to add hardware resources, resulting in higher performance, without requiring substantial 
rewriting of application programs[Al85J. By efficiency, we mean that we would like the 
performance of the architecture to improve linearly (if possible) with the amount of hardware 
resources and the amount of parallelism apparent in the algorithm. 

Applicative architectures, such as data flow and combinator reduction architectures, are 
efficient, general purpose, scalable architectures, and much work has been done to demon- 
strate this[AI85]. We are interested in specific implementation issues, and have built efficient 



parallel simulators for several applicative architectures on the CM. In this context, efficient 
means that as we increase the size of the connection machine, we should be able to get a 
corresponding linear improvement in the performance of the simulators (once again assuming 
sufficient parallelism in the algorithms being run on the simulated architectures). 

This simulation allows us to quickly and easily experiment with new ideas for applicative 
architectures, dramatically reducing the expense of such experimentation, in much the same 
way as MIT's proposed Multiple Processor Emulation Facility[ADI83] will when completed. 

We have designed simulators for three applicative architectures: 

• Static data flow[Den74], 

• VIM style dynamic data fiow[Den80], and 

• Combinator reduction [Tur 79]. 

We have actually implemented the static data flow and combinator reduction simulators. In 
this paper we describe the design of those simulators for running on the CM, and some lessons 
we have learned from the design and implementation of these simulators. 

1.1 Background 

The connection machine is an SIMD machine, in which every processor has some local state 
and a connection to a router network. The connection machine processing elements are bit 
serial, which means that it requires quite a few clock cycles to perform a complicated operation 
such as floating point multiply. There is a front end computer (sometimes called a host) which 
interacts with the connection machine. In order to reduce the required bandwidth between the 
host and the CM processors, there is a microcontroller, which receives high level instructions 
(such as 'add two numbers') from the host, and issues many low level bit operations to the 
connection machine processors. 

The applicative architectures that we are considering are all multiple instruction multiple 
data (MIMD) architectures; each processor may need to do something different with its data. 



In order to simulate such architectures, we need to be able to simulate an MIMD machine on 
the CM. We will exploit the fact that there are not really very many different kinds of things 
that can happen in an applicative architecture (e.g. in a static data flow architecture, there 
are only a few different kinds of graph nodes, and in a combinator reduction architecture, 
there are only a few different combinators) . Our general strategy is to cycle through each 
of the different things that could be done by a processor. Given a thing to be done, there 
is some state transition of the processors which want to do that thing. We select those 
processors which 'want' to do that thing, and issue the SIMD instructions to perform that 
state transition. 

This paper assumes that the reader is familiar with the applicative architectures that we 
are simulating. 

1.2 Results 

We have learned a few important lessons for running complex programs on the connection 
machine: Writing programs directly in low level connection machine languages is very difficult, 
and it helps to have structured programming languages to keep this complexity under control. 
It is also important to carefully choose low level primitives; in particular, the choice of message 
passing primitives is very important. 

We can run relatively large applicative programs efficiently enough to gain useful experi- 
ence in the design of programming constructs and program development tools for applicative 
programs. 

In some sense, our simulators are an existence proof of the statement that the connection 
machine can do general purpose parallel computation: We can run functional languages on 
the connection machine. 



10 



1.3 Floor plan 

Chapter 2 provides an abstract description of the connection machine by describing some 
'high level instructions' for the connection machine. These high level instructions are called 
macro instructions. Since the connection machine microcontroller can be programmed, the 
set of macro instructions can be chosen to suit the needs of the programming community. 
We will describe how some of the more complex instructions that we have found useful are 
implemented. 

Chapter 3 describes the applicative architectures that we are simulating, along with the 
design of our simulator for each of those architectures. We note that the choice of macro 
instructions can dramatically effect the design and complexity of the simulators. 

In order to control the complexity of writing MIMD programs for a SIMD machine, we 
designed a language called STARTALK which allows us to simulate MIMD architectures on 
a connection machine. Chapter 4 discusses the motivation for designing a new program- 
ming language for the connection machine, and describes the STARTALK language and our 
STARTALK compiler, including the optimizations which are performed. The simulators were 
written in STARTALK. 

Chapter 5 concludes with some preliminary performance measurements and a discussion 
of our conclusions and some remaining open problems. 



11 



Chapter 2 



The Connection Machine 



The connection machine (CM) is a highly parallel single instruction multiple data (SIMD) 
computer[Hil85]. This section provides an abstract description of the connection machine, 
along with enough of a description of the hardware to make it possible to understand how to 
code the algorithms that we will describe later in this paper. 

2.1 Connection Machine Hardware 

The prototype connection machine under construction at Thinking Machines Corporation, 
Cambridge, MA, consists of the following: 

• There are 64K processing elements (PE's). 

• There are 4K bits of local memory associated with each PE. 

• There are a few one bit flags in each PE. 

• Each PE is a one bit ALU. 

• Each PE is connected to a router which delivers messages between PE's through a 
network with a hypercube topology. 

12 



• There is also a two dimensional communication network, which we we will ignore in the 
rest of this paper. 

• There is a microcontroller which receives high level commands (macro instructions) such 
as 'add two fields' from the front end computer and issues bit operations to the PE's. 

• There is a front end computer which serves as the user interface to the connection 
machine. The front end computer issues macro instructions and provides a program 
development environment for the user. 

2.2 Connection Machine Software 

One of the fundamental concepts for programming the CM is the 'currently selected set' 
(CSS) of PE's. Most macro instructions are executed only by the PE's in the CSS, so that, 
e.g. during an ADD macro instruction, only those PE's in the CSS will actually perform the 
ADD, while the other PE's do nothing. We will describe some macro instructions, with the 
caveat that these macro instructions are not be the macro instruction set implemented for the 
actual CM microcontroller: The exact macro instruction set supported on the CM differs in 
being richer than the instruction set we describe here, and in certain other ways, such as the 
choice of names and arguments of operations. A data item always consists of a string of bits 
having consecutive addresses within the memory of a PE. Such a bit string is called a field, 
and can be characterized by its starting address in the 4K local address space, and a length. 
A pointer is a field containing enough bits to specify the absolute address of a PE (i.e. the 
PE number). The CSS is stored in one of the flags mentioned above (e.g. the context-flag in 
PE number t contains a one if and only if PE number i is in the CSS). 

2.2.1 Context Manipulation Macro Instructions 

There are three context manipulation functions which, together, can perform all context 
manipulation operations that we might desire. The context manipulation operations are very 



13 



fast, and each runs in only a few clock cycles. 

cm: set-context 

Sets the CSS to be all the PE's: 

CSS «- The set of all PE's. 

cm: load-context m 

Takes m as a memory address. Let M(m) be the set of PE's for which memory location m 
contains a one. (Note that, as always, "memory locations" are addresses in the processors' 
local memories.) 

CSS — CSSn M(m). 

I.e. the context is conditionally loaded from memory location m. 

cm: store-context address 

In all processors currently selected, a one is stored at location address, and at all the processors 
not currently selected, a zero is stored at location address. 

It is possible to perform any context manipulation with these primitives, but for perfor- 
mance reasons one might be interested in defining other context manipulation functions. 

2.2.2 Arithmetic Macro Instructions 

We will describe one of the 'arithmetic' macro instructions in detail, and hint at the richness 
of the instruction set that is actually available. These instructions involve no inter-PE com- 
munication, or PE to front-end-computer communication. They change the local state of the 
PE's (Le. the flags and the local memory). 

cm:lognot destination source length 

The destination, source, and length values are integers. The destination and source are ad- 
dresses in the 4K local memory of the PE's, and the length is the number of bits on which to 



14 



operate (i.e. the field length). Each bit of the destination field is set to the the logical not of 
the corresponding bit of the source field. This operation takes a few clock cycles for each bit 
of the operation. 

Similarly, there are operations to perform the bitwise logical and operations, and other 
logical operations. Since the connection machine is microcoded, it is possible to define these 
operations as two address or three address instructions. Near the other end of the complexity 
spectrum we find floating point operations, such as three address floating point multiply, which 
might take three addresses as an input and perform floating point multiplication conforming 
the IEEE floating point standard. The more complex operations can take longer: We might 
expect a floating point operation to take a thousand clock cycles. 

2.2.3 Global Or Wire Operations 

There is a global or wire which allows the host machine to find out if any processor satisfies 
a given predicate. The global or wire is actually implemented as a tree of or gates (see 
Figure 2.1). The macro instruction that uses this wire is as follows: 

cm:global-or location 

Returns a zero if and only if all the selected processors have a zero in location location, 
otherwise returns a one. This is a very fast instruction, and takes only a few clock cycles. 

2.2.4 Message Sending Macro Instructions 

There are several ways of sending data from one processor to another. We will ignore the two 
dimensional grid in this paper, and concentrate on the ways to communicate over the general 
purpose routing network. 

cm:send-data dest-memory dest-proc source-memory size notify-bit 

Sends data from PE's in the CSS to other PE's. The PE that is to receive a message is 

15 





To Host Computer 



Figure 2.1: The global or wire is implemented as a global or tree. 

specified with its absolute address starting in location dest-proc, which contains a pointer. 
Note that this allows the destination address to be computed locally by each PE. The data 
being sent starts at location source-memory and is of length size. The data is deposited at 
the field of length size starting at location dest-memory in those PE's that receive messages 
(the dest-memory field is unchanged in other PE's). Memory location notify-bit is set to 1 in 
those PE's that receive messages, and in those that do not, independently of whether they 
are in the CSS. If more than one message is received at any given PE, that PE receives a 
message, but the data deposited in dest-memory is unspecified (it may be one or the other 
of the messages, or it may be totally garbled). This specification allows that if m messages 
arrive at a given processor, the time for completion of the cm:send-data instruction may be 
at least linear in m (i.e. the time can be fi(m)) due to serialization of the arriving messages 
without violating this specification). 

The simple collision handling mechanism defined by cm: send-data places important con- 
straints on the design of applications that use routing. 

• If two processors need to send data to a third processor, and the data must not be 



1G 



garbled, the software must arrange at a higher level that the two processors do not send 
the data in the same cm: Bend-data instruction. 

• Even if the data can be garbled (an example of this would be where the information 
being sent is a boolean, and the fact that a message has arrived at all can be construed 
as a true), the cm: send-data instruction still does not do the job if a large number of 
processors need to send the value to a single destination (it is too slow). The problem 
here is due to the serialization done by cm: send-data at the destination. 

• There is no way for a processor to efficiently retrieve data from another processor (with- 
out resorting to high level conventions). In particular, if processor A contains data, and 
processors B 1 ,B 2 , ..., B m need to access that data, it may take linear (in m) time to 
get the data distributed to the B'b. If, on the other hand, every m is small, this sort of 
fetch operation can be done by having each processor sends its own address to the pro- 
cessor being fetched from, and that processor will send its data to the processor whose 
address it has just received. An implementation for such a fetch operation is shown in 
Figure 2.2. 

The original solution that we proposed to these problems was to use a higher level software 
convention to work around the problems in cm: send-data. One design involved copying data 
whenever too many pointers to the processor containing the data are needed, and another 
design uses fan-in trees. 

Fan-in trees work as follows (See Figure 2.3): If m processors need to communicate with 
one, then a tree of processors, of depth n(logm) is built, and all communication is done 
through the tree. In particular, if processors B lt . . . , B m need to send a value to processor 
A, then the B's send it into the tree, and the internal tree nodes, when they receive one or 
two messages, forward one message toward the root. In this way, processor A is guaranteed 
not to receive too many messages, and the messages will not be garbled. The tree can also be 
used to retrieve data from processor A, by sending a request for the data towards the root of 
the tree, and letting the data from A be sent back towards the leaves of the tree. Also, any 

17 



(defun cm: get (from-proc dest source length) 

The WITH-SCRATCH-SPACE form allocates temporary memory in 
all the processors. E.g. the lisp variable TEMP-CONTEXT is 
bound to an integer which is the address of one bit of scratch 
memory, and GETER- ADDRESS is bound to an integer which is the 
address of a field which can contain a pointer to another 
processor, 
(with-scratch-space ((temp-context 1) 

(geter-address *addresB-size*) 
(remaining-context 1) 
(received-p 1)) 
(cm: store-context temp-context) 
(cm: store-context remaining-context) 
(while (not (zerop (cm:global-or remaining-context))) 

Each processor sends its own address. Note that every 
processor stores its own address in a location specified 
by *SELF-POINTER* . 
(cm: send-data geter-address from-proc 

*self-address* *address-size* received-p) 
; ; now the guys who received a message should reply 
(cm: set-context) 
(cm: load-context received-p) 

(cm: send-data dest geter-addresB source length received-p) 
; ; now the guys who received the reply drop out of the 
; ; computation 
(cm: set-context) 

; ; set remaining-context to zero if received-p 
(cm:lognot received-p received-p 1) 
(cm:logand remaining-context received-p 1) 
; ; and load the context to try again 
(cm: load-context remaining-context) ) 
; ; now restore the original context 
(cm: Bet-context) 
(cm: load-context temp-context))) 



Figure 2.2: A naive implementation of cm: get which works if not too many processors are 
fetching from any given processor. 



18 



Data can be distributed from right to left. 




7" 

Data can be added or OE'd as it moves from left to right. 

Figure 2.3: Fan-in trees are a high level convention which can handle the problems associated 
with using cm: Bend-data.. 

associo-commutative operation can be used to combine messages when such a tree has been 
built (e.g. the contents of the messages can be added together, or they can be bitwise or'd). 
We actually implemented such fan-in trees for our static data flow simulator, and they 
were responsible for most of the complexity of the interpreter, as we shall show in Section 3.1, 
where we discuss the implications of using fan-in trees or other high level software conventions 
to avoid the problems inherent in cm: Bend-data. The complexity of implementing these fan- 
in trees is almost overwhelming in the simulation of combinator graph reduction and VlM 
style dynamic data flow. Thus, we define several operations to address these issues. We will 
give the specifications of the operations, along with a sketch of their implementations on the 
connection machine. 

cm: send-with-add dest-memory dest-proc source-memory size notify-bit 

Sends data as for cm: uend-data, except that if any two messages are being sent to the same 
location, their data is added together. In order to achieve good performance, this addition 
must, in general, occur before the messages actually arrive at their destination (e.g. the 



19 



addition could happen as soon as two messages collide in the network). 

cm:send-with-logior dest-memory dest-proc source-memory size notify-bit 

Sends data similarly to cm: send- with- add, except that when messages collide, their data bits 
are bitwise inclusive or'd together. 

We also define an instruction for fetching data from other processors: 

cm: get from-proc dest source length 

Each selected processor ends up with its memory field at location dest of length length con- 
taining the data that was in the field at location source of length length in the processor 
named in the pointer field at from-proc. 

Figure 2.2 contains an implementation of cm: get which is correct except that it is too 
slow. 

All of these more general operations must be implemented by using fan-in trees at some 
level, either in hardware or in software. 

The NYU Ultracomputer[DGK86] implements these fan in trees in hardware: The trees 
are built on the fly by the routing network. The connection machine could use the same 
tricks as the Ultracomputer, except that it would have to simulate the tricks in software (and 
they would be implemented in microcode). We use another technique (described below) to 
implement these abstract routing operations. 

Since the prototype connection machine does not support general routing operations such 
as cm: send-with-add and cm: get directly in hardware, a software simulation is done. The al- 
gorithm used is to sort the messages, using a standard parallel sorting algorithm[BRR,Kus85], 
according to the destination address (See Figure 2.4). After sorting the messages, we have the 
property that messages going to any given PE are in adjacent PE's. The PE's can then, in one 
step, form themselves into linked lists (each list containing messages going to one particular 
PE). Given the linked lists, the PE's can form themselves into balanced binary trees by first 



20 



passing their own addresses one element to the left (forming links which span two PE's), and 
then two elements to the left (forming links which span four PE's) and so on. The combining 
can then be done in log n phases, where n is the maximum number of messages going to any 
processor. 

It is also possible to use the sorting and scanning trick mentioned for cm:send-with-add 
and shown in Figure 2.4 to do the cm: get operation quickly. 

We can generalize these operations a little further, and design an operation that builds 
processors into a tree. 

Definition 1 A class of binary trees is roughly balanced if the maximum depth of any tree 
is a polynomial in the log of the number of vertices in that tree. 

cm: build-tree key key-length root-p left left-p right right-p 

The cm: build-tree operation builds trees out of processors. All the processors with the same 
key are formed into a tree. The key is stored in a field of length key-length at memory address 
key. Thus if there are n distinct keys actually appearing in various selected processors, we will 
end up with n trees. The trees are binary trees, which are required to be roughly balanced. 
(I.e. there is a polylog function which says how unbalanced the tree can be.) Each processor 
has the single bit at memory location root-p set to one if it is the root of the tree in which it 
appears, otherwise it has the one bit field at root-p set to zero. Each processor that has a left 
child, has the one bit field at left-p set to one and the pointer field at left set to the address 
of that left child. If there is no left child, then left-p is set to zero and the data in the pointer 
field at left is undefined. Similarly, right-p is set to one iff the processor has a right child, and 
if so, right contains the address of that child. 

Our early designs explicitly maintained the fan in trees needed to implement these abstract 
operations. Our final design uses microcoded versions of these operations to build fan in trees 
on the fly. One can argue that by repeatedly building the fan in trees over and over, we 



21 





PE's 0, 3, 4, 5, and 8 actually communicate with the destination 
and then the form trees out of the PE' - 6 

which want to communicate to the same destination. 




Then the intermediaries send s 
the data back to the source, i 



> 




Figure 2.4: Sorting and scanning to make cm: send-with-add run quickly. 



22 



have wasted computating resources, but we believe that the large constant factor gained by 
microcoding these routines offsets the expense of rebuilding the fan in trees every time they 
are needed. 

2.2.5 Consing Macro Instructions 

By 'consing' we mean the operation of allocating resources. In our case, we are interested in 
allocating PE's. It is easy to allocate PE's one at a time under the control of the front end 
computer, but we are interested in allocating PE's in parallel. In particular, we would like 
the following operation. 

cm: cons new-address want free 

Here, new-address is the address of a pointer, want is a memory address which specifies which 
PE's want to cons (i.e. those with memory location want set to one), and free is a memory 
address which specifies the set of PE that are free. This operation is done regardless of the 
CSS. Let W be the set of PE's for which want is one, and F be the set of PE's for which 
free is one (it is not necessary that W and F be disjoint, although it is expected that in most 
applications F and W will be disjoint). If \F\ < \W\ then the result is unspecified. Otherwise, 
each element of F is given the address of a unique PE in W, and in each PE from F that is 
allocated, free is set to zero. 

This cm: cons operator gives us much power. For example, we can copy a graph in constant 
time (actually, in time which is proportional to the maximum degree of the graph), once the 
graph has been identified (see Figure 2.5). To do this we cause each element of the graph to 
cons up a free PE, then forward the address of that free PE to its neighbors in the graph, 
and then, having received addresses from its neighbors, sending those addresses to the new 
PE[Chr84]. 

The cm: cons operation is of sufficient complexity that we believe that an explanation of its 
implementation is in order. The cm: cons operation is implemented in terms of an 'enumerate' 



23 




Get CDE'b new cell and fill it in. 



Figure 2.5: Copying graphs in constant time. 



24 



operation: 

cm: enumerate number 

The number argument is the address of a pointer. Let n be the size of the CSS. For each PE 
in CSS the pointer starting at number is set to a unique integer in the range (inclusive) to n 
(exclusive). The value n is returned as a result of the call to cm: enumerate in the front end 
computer. 

If we are given cm: enumerate, we can implement cm: cons by performing an enumeration 
of the free set, and an enumeration of the set that wants to cons. Let n be the minimum of 
the size of the set that wants to cons and the size of the free set. Each PE sends its own 
address to the value it received during the enumeration. For each < t < n there is one PE 
/ in the free set and one PE w in the consing set which received t during their respective 
enumerations, thus PE t will receive two messages, one containing / and one containing w. 
Then PE t sends / to w, and now to has the address of a unique free PE. A program which 
implements cm: cons in terms of cm: enumerate appears in Figure 2.6 

In order to implement cm: enumerate, we can use subcube induction, which is a special- 
ization of tree induction. In subcube induction, we statically organize all of the PE's into a 
balanced binary tree with PE's at the leaves (see Figure 2.7). We then enumerate inductively 
on the size of the tree. A tree containing one PE gets a one if that PE is selected, or a zero 
otherwise. A tree of depth k first enumerates its subtrees, and determines the number of PE's 
in each of the subtrees. If there are I PE's in the left subtree and r PE's in the right subtree, 
then each of the PE's in the right subtree is instructed to add I to its value, and we know 
that there are I + r PE's in the tree of depth k. 

On a complete butterfly network (with processors at all the internal nodes) the running 
time is 0(logN). It is possible to simulate a complete butterfly on the CM for the enumerate 
operation with an extra 0(loglog JV) time[BRR] (where N is the number of processors in the 
CM). The cm: enumerate macro instruction has been "speed hacked" to run faster than a 



25 



(defun cm: cons (new-address want free) 
; ; get some scratch memory 

(with-scratch-space ((temp-context 1) ;; temp-context is one bit 

; ; allocate some space for enumeration 
(want-addr *address-size*) 
(free-addr *addreBB-size*) 
(conser *address-size*) 
(consee *address-size*) 
(received-conser 1) 
(received-consee 1) 
(ignore-bit 1)) 
; ; now save the context , so we will be able to restore it . 
(cm: store-context temp-context) 
(let ((number-wanting nil) 
(number-free nil) ) 
;; unconditionally load the context with WANT, and enumerate them 
(cm: set-context) 
(cm: load-context want) 

(setq number- wanting (cm: enumerate want-addr)) 
; ; and send ones own self-address to the rendezvous point 
(cm:send-data conser want-addr *self-address* *address-size* 
received-conser) 

; ; now do the same for FREE 
(cm: set-context) 
(cm: load-context free) 

(setq number-free (cm: enumerate free-addr)) 
(if (< number-free number-wanting) (error) ) 
(cm: send-data consee free-addr *self -address* *addresB-Bize* 
received-consee) ) 
now the rendezvous point does its work. 

anyone who received two messages (i.e. the received-conser 
and the received-consee bits are true) should do it 
(cm: set-context) 

(cm: load-context received-conser) 
(cm: load-context received-consee) 
; ; actually do the send 

(cm: send-data new-address conser consee *address-size* ignore-bit)) 
; ; now restore the original context 
(cm: set-context) 
(cm: load-context temp-context)) 



Figure 2.6: A program to implement cm: cons in terms of cm: enumerate. 



26 




Each node computes (L,L-fE) 

Where 1 is the number of active children in the left subtree, and 

E. iB the number of active children in the right subtree. 



Figure 2.7: Enumeration by subcube induction. 

typical cm: Bend-data instruction. 

In general, the instructions which involve communication cost much more than the instruc- 
tions which manipulate the local state of a processor, however there are a few exceptions, such 
as floating point operations, in which the local operation costs as much as a communications 
operation. 



2.2.6 Virtual Processors 

It is possible to 'time share' PE's to allow a larger set of 'virtual' PE's, each virtual PE having 
a smaller local memory and running correspondingly slower. There is microcode to support 
virtual PE's, and because we use less than 512 bits of the local memory to implement our 
simulator whereas there are 4096 bits of memory in each PE, we can run our simulators with 
half a million virtual PE's. 



Chapter 3 



Applicative Architectures 



We simulate static data flow, VlM style dynamic data flow, and combinator reduction archi- 
tectures. This section defines what we mean by each of these architectures and describes a 
high level implementation strategy for each. 

3.1 Static Data Flow 

A static data flow computer[Den74] consists of two active components: 

• The program graph, and 

• the structure storage. 

3.1.1 Static Data Flow Program Graphs 

Abstractly, a static data flow program graph is a directed graph with processing elements at 
the vertices. Data values travel along the arcs of the graph, and processing elements 'fire' 
when enough data is available on their input arcs, consuming data values from their input 
arcs and generating data values on their output arcs. We will simulate the program graph by 
statically allocating one connection machine PE to each data flow PE, and sending messages 

28 



among the PE's. The macro instruction sequence that drives the simulator will serially select 
each 'kind' of data flow cell (e.g. adder cells and then multiplier cells) and subselect those 
that have enough data to fire. Then the remaining set of PE's will be instructed to do some 
local arithmetic which is dependent on the 'kind' of the cell selected (e.g., add the two input 
numbers to create an output number for data flow 'ADD' nodes), and finally the output results 
will be sent as messages. The next 'kind' of data flow cell is then selected to run and we do 
the whole process again. 

3.1.2 Static Data Flow Structure Storage 

The structure storage of a static data flow system allows us to have data values which are too 
large to represent with one message packet, instead these large data values are stored on a 
heap, with pointers used to represent the values. We choose cons cells as our basic structure 
storage, and introduce a few new kinds of program graph cells. 

Balanced trees of cons cells can be used to implement arrays with logn access latency 
(where n is the number of elements in the array). This is just a special case of the way 
that balanced m-way trees are used to represent arrays in the VAL simulator[AD79] or in 
VlM[DSG85]. 

Each cons cell will be dynamically allocated a PE on the CM, and the addresses of these 
cells will be passed around in the program graph. When the program calls for the CAR of 
a cons cell, the data flow graph obtained by compiling the program contains a CAR node. 
When the processor, M, which represents the CAR node of the data flow graph receives a 
message containing a pointer to a processor, N, representing a cons cell, M must extract the 
CAR of the cons cell represented by N. The details of how this extraction is performed and 
the impact on garbage collection will be discussed below. Our first design uses a high level 
software convention which copies a cons cell whenever too many pointers to the cons cell are 
needed. Our second design uses fan-in trees (which were introduced in Section 2.2.4). Our 
third and final design uses the cm: get and cm: Bend-with-add instead of cm: send-data, and 



29 



New Pointer 




Figure 3.1: When a. third pointer is needed to a cons cell, we copy the cons cell, 
avoids most of the complexity of the first two designs. 

Structure Storage with no Fan-In 

Our first design for the static structure storage uses a high level convention which has the 
property that at most two pointers to a given cons cell exist at any time, and if more pointers 
are needed, the cons cells are copied, copying the data (See Figure 3.1). Note that since the 
data can be cons cells, we may end up consing up a lot of processors to represent relatively 
few cons cells (See Figure 3.2). 

In this design, to implement CAR, a message will be sent to the processor representing 
the cons cell asking for the CAR, and the reply will contain the needed value. To implement 
garbage collection with this design, we introduce a kill message which says that a pointer is 
no longer needed. When both of the pointers to a cons cell have been killed, the cons cell can 
be deallocated. This is a very simple reference counting scheme, in which we can prove that 
the reference count for a cons cell becomes no larger than two. Thus, a pointer to a cons cell 
can be considered to be a capability to perform operations on the cell. 

Since we do not support a REPLACE-CAR operation (we are, after all simulating architectures 



30 




"^ 




New Pointer 



<l< 


/^ 


1 \ 


s 1 


1 1 \/' 


1 



A 



We may even do redundant copying 



Figure 3.2: When copying a cons cell, the data needs to have the copy operation performed 
also. In this case, the whole tree needs to be copied because all the cons cells are 'saturated'. 

which support applicative languages), we know that our structure storage graph is acyclic, 
and that the only important notion of equality between two subgraphs is the standard LISP 
EQUAL operation: Two objects are EQUAL if they are both atoms (e.g. integers), or if they 
are both cons cells and their respective CAR's are equal and their respective CDR's are equal. 
This definition of equality means that it is acceptable to copy a cons cell to handle the fan-in 
problem. 

We want cons cells to have the following operations: 

cons car cdr 

Create and return a pointer to cons cell with car and cdr as specified. 

kill object 

If object is a pointer to a cons cell, then it will be an error in the future to use object to 

perform operations. 



31 



copy object 

If object is an atom, then return object, otherwise, create a pointer to an object which is equal 
to object. 

car object 

If object is an atom then this is an error. If the CAR of object is an atom then return the 
CAR, otherwise return a pointer to an object which is equal to the CAR of object. An implicit 
KILL of object is performed by this operation. Thus, if the program needs to keep a copy of 
the original cons cell, and manipulate the CAR of the cons cell, first a copy should be done, 
and then a car should be done. The rationale for this implicit KILL is that a data flow CAR 
operator takes a token containing the address of a cons cell on the heap, and creates a token 
containing the CAR of that cons cell, destroying the original token in the process. Thus, it 
makes sense for there to be an implicit KILL to be performed when the CAR is taken. If a 
program needs to keep a copy of a pointer to a cons cell, then the corresponding data flow 
program will make sure to have made a copy of the pointer to the cons cell before passing the 
pointer to the CAR data flow node. 

cdr object 

Similar to the CAR operation, except it operates on the CDR of the object. 

appendcar object newcar 

If object is not a pointer to a cons cell then this is an error. Otherwise, return a pointer to a 
cons cell with CAR equal to newcar and CDR equal to the CDR of object.This operation implicitly 
kills object and newcar in order to allow efficient implementation. In other words, the "caller'' 
gives up the capability for object and newcar, and gets back a capability for a new cons cell. 
The returned cons cell is, in general, a new cons cell, but if the implicit kill operation would 
deallocate the old cons cell, we can reuse it. This reuse is called appending in place. 



32 



There is a corresponding APPENDCDR operation which returns a pointer to a cons cell with 
the CDR modified. 

Our main motivation for introducing this capability based system is that the message 
delivery mechanism provided on the CM does not allow multiple messages to be simultaneously 
delivered to a single PE. By introducing the capability system we can guarantee that some 
statically bounded (say two) number of pointers to any given PE exist at any given time. 
Thus, if our static bound is TV (in this case N = 2), we can add [log TV] bits of information, 
called the in-box number to our capability pointer and avoid message collisions. We will 
guarantee that for any in-box number t and any PE number p there will be at most one 
capability pointing to in-box number i on PE number p, and we can treat a capability as 
a permission to write into a particular in-box. During any message send, we will send only 
messages which have the same in-box number, and thus there will be no message collisions. 

Thus, when an cons cell is created, we create a capability with its in-box number equal to 
zero, and every time we need to copy a capability, we might perform the following algorithm: 

• If some in-box number is unallocated, then we pick such an in-box number and return 
a capability with that in-box number. 

• If every in-box number is allocated, then we recursively CDPY the CAR and the CDR of 
the object, and CONS up a new object with that CAR and CDR. 

When we KILL a capability which refers to a cons cell, we merely note that the in-box 
number of the capability is free. If all the in-box numbers are free, then we can KILL the CAR 
and CDR of the object, and return the PE in which the object resides to the set of free PE's. 

When we take the CAR of an object, we can note whether the reference count of the object 
is one. If it is, we can simply return the CAR of the object, KILL the CDR of the object, and 
return the PE in which the object resides to the set of free PE's. If the reference count of the 
object strictly greater than one, then we decrement the reference count and return a COPY of 
the CAR of the object. Taking the CDR of an object is similar. 



33 



When we APPENDCAR an object, if the reference count is one we can KILL the old CAR of 
the object, and store the new car in the objects CAR location, returning the original capability. 
If the reference count is greater than one, then we copy the CDR of the object, and CONS up a 
new cell using the new CAR and the copy of the CDR. Performing an APPENDCDR is similar to 
performing an APPENDCAR. 

Thus, by designing our system to use a capability scheme (and enforcing the scheme by 
software conventions), we gain the following advantages: 

• We avoid message collisions at the destination PE's. 

• We have garbage collection (using reference counts). 

• We can optimize certain operations (such as APPENDCAR when the reference count is one) . 
For applicative programming languages, this can be an important optimization[DSG85]. 

There are two often cited disadvantages of reference counted garbage collection which[Den80] 
rebuts for the case of applicative archictures in general. We believe our system also deals with 
these alleged disadvantages satisfactorily: 

• Reference counting does not work for cycles. However, because we are implementing 
an applicative system, our structure graphs can never have cycles, and this objection is 
unimportant. 

• Reference counting is expensive. However, given the nature of the CM message delivery 
scheme, we need some mechanism to prevent multiple messages from arriving at the 
same place at the same time. Our scheme is to keep the reference counts of PE's 
very low, and once we have implemented this scheme, most of the expense of reference 
counting (Le. the incrementing and decrementing of the reference count, and dealing 
with overflow of the reference counter) are dealt with. On a serial machine, there are 
some real-time considerations: In general, we can not predict how many cons cells will 
be freed by a single KILL operation, and thus the time to perform a KILL can take 



34 



arbitrarily long on a serial machine (unless one implements a mechanism whereby some 
of the work involved in the KILL operation is defered). On a parallel machine, the PE 
which sent the KILL message does not need to wait for the KILL operation to complete 
before continuing, and so no additional waiting is incurred. 

There is one problem with the algorithm that we have specified, and that is that it is 
not very efficient in space. On a serial computer with no reference counts, it always takes 
0(n) memory locations to represent a structure storage containing n cons cells. With the 
algorithms we have specified, the number of PE's required to represent n cons cells is also 
dependent on the number of copies of the graph we have made, and the order in which we 
have made the copies. 

For the following theorems, we will assume without loss of generality that the maximum 
reference count for a PE is two. 

Definition 2 A PE is saturated if there are two pointers pointing at the PE. A graph of 
PE's is saturated if all the PE's in the graph are saturated. 

Theorem 1 Using the algorithms specified above, in the worst case, it takes fi(nm) PE's to 
represent n cons-cells with m pointers pointing 'in' to the cons cells from the 'outside ' of the 
structure storage (e.g. from the program graph). 

Proof: First we construct a balanced binary tree containing n = 2 k — 1 cons cells (where k 
is the depth of the tree) . For the worst case analysis, assume that all of the PE's representing 
the cons cells in the tree are saturated, and that there are two pointers from the 'outside' 
(we can easily force this to be the case by constructing the graph in the right order). Let X 
be one of the two pointers to the head of the tree from the outside. If we perform a COPY 
operation on X, we will need to copy the entire tree, because when we need to copy the PE 
at the head of the tree, it will already be saturated, and we will need to cons up a new cell, 
copying the CAR and the CDR of the cell. The same thing will happen while copying the CAR 
and the CDR of the cell because they will be saturated. Thus the entire tree will be copied. 



35 



(See Figure 3.2 for an example of how this behaves.) If we COPY X again, the entire tree will 
be copied again, and thus if we make m copies of X, we will need mn PE's to represent the 
graph. Thus in the worst case, we need at least mn PE's to represent m pointers to n cons 
cells. D 

In fact, we can construct a worse lower bound. 

Theorem 2 Using the algorithms specified above, in the worst case, it takes n(m2 n ) PE's 
to represent n cons-cells with 0(m) pointers pointing to the cons cells from 'outside' of the 
graph. 

Proof: Suppose that the atoms at the leaves of the tree in Theorem 1 are all the same 
(See Figure 3.3). In that case there are only k distinct cons cells in the tree (by our definition 
of equality). The tree can be constructed with only k explicit CONS operations, again by 
construction, and it will take 0{2 k ) cells to represent the tree with only one copy of it (since 
the leaf of the tree (there is only one distinct leaf) will have 2 k ~ 1 pointers pointing at it, it 
will need at least 2 k ~ 2 cells to represent it). Then when we make m more copies of X we will 
have consumed U{m2 n ) PE's to represent m copies of a structure with only n distinct cons 
cells. □ 

And finally, we can show that the algorithms specified above are not always bad, Le. there 
are cases where the algorithms run well. 

Theorem 3 Using the algorithms specified above, in the best case it takes Q(n + m) PE's to 
represent a graph with n distinct cons-cells and m pointers from the 'outside' into the graph. 
This best case can actually be realized. 

Suppose that we have a balanced binary tree with n = 2 k — 1 distinct cons cells represented 
by 2 k - 1 PE's, each with reference count equal to one. We can make m copies of the tree by 
the following algorithm: 

• Let Xq be the head of the tree. 



36 



/ 


I 


\\ 


1 


1 


\\ 


I 


1 


w 


\ 



I I \ 



f 


/ 




1 


/ 



L / 



i r 



J-U 




Figure 3.3: If all the leaves of the tree are equal, it can take f2(m 2 n) PE'b to represent m 
copies of a structure with only n distinct cons cells. 



/ 



"V* 





Figure 3.4: In the best case it takes &(n -f m) PE's to represent a graph with n distinct cons 
cells and m pointers into the graph. 



• For each * in {l, . . . , m} let X, be the copy of X,_i (See Figure 3.4). 

Now we will need a few definitions and a lemma to show that the above algorithm works. 

Definition 3 If we have a single pointer, p, into a graph of cons cells, then a PE, r, which 
is part of the representation of that graph is on level t if there is a way to take t — 1 CAR 's 
and CDR 'a from the pointer, and end up at r. (For example the top cons cell is on level one.) 
(Note that we mean the abstract operations of CAR and CDR rather than the implementation 
of CAR and CDR which can modify the representation, in the structure storage, of the objects. 

For complete balanced binary trees the level is clearly unique. 

Definition 4 Given a pointer, p, into a complete balanced binary tree of cons cells, p is called 
A:-loaded if all PE's on level I < k have reference count equal to two, and all PE's on level 
I > k have reference count equal to one. 

Note that our definition of k-loaded is very restrictive. There are binary trees which are 
not fc-loaded for any k. It turns out that the binary trees which we will build in the following 
proofs are k loaded. 

Lemma 1 Starting with a complete balanced binary tree of distinct cons cells of depth k 
represented by 2 — 1 PE's each with reference count equal to one, the above algorithm will 
consume 2 k additional PE's during the m = 2 k+1 — 1 COPY operations, and X m will be k- 
loaded, and the PE's on level k and higher will be shared with Xq, and if the complete balanced 
binary tree is not the whole graph reachable from Xq, any PE's on level k + 1 and higher will 
have the same reference count they started out with. 

Proof of lemma: Inductively on A;. 

When k = 0, m = 1, and we can make one copy of the tree without consuming any 
additional cons cells (since the reference count of the head was one.) After copying, X\ will 
have the property that only the top cons cell (at level one) will have reference count equal 



38 



to two, and the rest of the PE's will have reference count equal to one. All the PE's are still 
shared with Xo, and any PE's on level one or higher have not been touched so their reference 
counts are the same. 

If the lemma is true for k = I, we need to show that it is true for k = 1+ 1. During the first 
a = 2' — 1 copies we get that X a is I loaded by the inductive hypothesis. Thus creating X a +i 
will force all the PE's in the first I levels to cons up new PE's which will all have reference 
count equal to one. The PE's on level /+ 1 will have their reference counts equal to two, since 
the cons cells are all distinct and thus every PE on level / will have increased the reference 
count of a distinct PE on level I + 1 (and the reference counts on level I + 1 were one until 
this point). If we then perform the next a copies to create X 2 „+i we know by the inductive 
hypothesis that the PE's on level I + 1 still have their reference counts equal to two, and 
X^a+i is /-loaded, so X^a+i is actually I + 1-loaded. The PE's on level / + 1 or higher are 
still shared, and any PE's on level J + 2 or higher have their reference counts unmodified. □ 

Lemma 1 thus proves Theorem 3. □ 

It is important to realize that no matter what we do, if we have a complete balanced tree 
of depth n, with only n distinct cons cells in the graph, we will be forced to use ft(2 n ) PE's 
to represent the graph because of the fan-in limitations that our methodology requires. Thus, 
for this case we can not expect to do very well, and it is not fair to judge our algorithm on 
the f](m2 n ) bound derived in Theorem 2, but instead to judge our algorithm on the ft(mn) 
bound derived in Theorem 1. The reason that this is important, is that we will now describe 
an algorithm which allows us to achieve much better worst case bounds for the tree of depth 
n with 2 n distinct cons cells, but the improvement will still by swamped by the exponential 
term for the case of the tree of depth n with only n distinct cons cells. 

Fan- In Trees 

We can improve on the worst case bounds given above, making them essentially the same as 
the best case bounds of the above algorithm by introducing fan-in trees into our design. 



39 



First we modify our specifications a little bit in order to make our implementation a little 
easier. We require that the COPY operation return two pointers and that it performs an 
implicit KILL of its argument. 

The algorithm: Structure storage is represented by two classes of PE's. The cells and the 
fan-ins. 

A cell behaves in approximately the same way that a cons cell PE behaved for the last 
algorithm, except on a COPY p operation: If the reference count is two, then the cell conses 
up a fan-in PE which will 'forward' references to p. The two returned pointers are pointers 
to the fan-in PE's in-boxes. 

A fan-in PE acts like a buffer, 'protecting' its forward pointer from requests. The fan-in 
PE has all of the reference counting machinery of a cons cell, plus a location to cache the CAR 
and CDR of its protected cons cell. As soon as the fan-in cell has cached both the CAR and the 
CDR of its protected cell, the fan-in PE changes its 'type' into a cell PE. 

In order to make this caching work, we introduce two new operations called CAR-KEEP and 
CDR-KEEP which are just like CAR and CDR respectively, except that they do not perform an 
implicit KILL on their arguments. 

When a fan-in PE, p, receives a 

KILL, The fan-in PE decrements the reference count, and notes which of its 'in-boxes' is 
freed by the KILL. If the reference count drops to 0, then the PE sends a KILL message 
to its 'protected' pointer and returns itself to the free pooL 

COPY p, The fan-in PE checks to see if it has a free in-box. If so then it returns p and a 
pointer to the free in-box, noting that the in-box is no longer free. If there is no free 
in-box then it creates a new fan-in PE which 'forwards' to p, returning two pointers, 
both of which point to the new fan-in PE. 

CAR p The fan-in PE checks to see if it has cached the CAR. If not it issues a CAR-KEEP request 
to its protected cell and caches the reply. At this point, the CAR has been cached. If 
the implicit kill of the p (the fan-in cell) decreases the reference count to zero, then we 



40 



can simply return the CAR and kill the cached CDR if it is present. If the reference count 
is still positive, then we send the CAR a COPY message, caching one of the results in its 
own CAR and returning the other result. 

CDR p, The fan-in PE behaves similarly to the way it did for CAR p. 

CAR-KEEP p, The fan-in PE behaves the same as it did for CAR p, except that we do not 
perform the implicit KILL, so we always have to perform a COPY of the cached CAR. 

CDR-KEEP p, The fan-in PE behaves similarly to the way it did for CAR-KEEP p. 

APPENDCAR p c, If the fan-in PE has reference count equal to one, then it sends a KILL 
message to the cached CAR if it is present, and caches c as the CAR. If the fan- in cell has 
reference count greater than one, then it conses up a new cons cell, storing c in the CAR 
of the new cell, and its CDR (either obtained by performing a COPY of the cached CDR if 
it is present, or else by sending a CAR-KEEP to its protected cell) in the CDR of the new 
cell. The new cell is returned, and the reference count of p is decremented (which never 
results in a KILL message being sent to its cached CDR because in this case the reference 
count was not one.) 

APPENDCDR p c The fan-in PE behaves similarly to the way it did for APPENDCAR p c . 

Theorem 4 In the worst case, the above algorithms consume at most 0{1) PE's and 0(log n) 
time for each operation (i.e. CAR, CDR, COPY, etc.), where n is the size of the tree being 
operated on. 

Proof: We will do a case analysis on the operation: 

COPY, At most, one new fan- in PE is consed up, which takes constant time. 

KILL , Consumes no PE's. An arbitrary number of PE's may be freed by any particular KILL 
operation, but there is no need for the sender of a KILL message to wait until the KILL is 
completed before continuing with useful computation. Even so, the arbitrary amount of 



41 



computation consumed by a KILL operation is inherent in any reference counted scheme, 
and we depend on a parallel model of computation in order to justify not counting this 
expense as part of the KILL. If we were to amortize the serial cost of the KILL over 
all the KILL's, COPY's, and CONS's we could easily show that KILL's are cheap 'on the 
average'. (The cost of performing all the KILL's is 0(c) where c is the number of CONS's 
and COPY's, and so the average cost of any particular operation is 0(1).) 

CAR-KEEP and CDR-KEEP, A COPY is performed, which consumes at most one PE and takes 
constant time. Actually taking the CAR takes at most O(logn) time. 

CAR and CDR, May perform the corresponding KEEP operation (which consumes at most one 
PE and takes constant time), and then either a KILL (which takes constant time and 
consumes no PE's), or a COPY (which also takes constant time and consumes at most 
one PE). (Thus we may consume up to two PE's during a CAR or CDR.) 

APPENDCAR and APPENDCDR, Either operation performs 

• a KILL, or 

• a cons, followed by either a CAR-KEEP or a COPY. 

The most expensive path is to perform a cons (consuming one PE) followed by a CAR- 
KEEP (consuming one more PE). □ 

The fan-in tree has the property that it smooths out the worst case behavior for copying 
pointers at the expense of making the CAR operation slower: The CAR operation was very fast 
when cons cells were copied since the PE pointed to always contained the data. The CAR 
operation could take much longer in the system described in this section because the request 
for the CAR must propagate through the fan-in tree. Note, however, that the CAR is cached so 
that the expected time for performing a CAR operation improves as more CAR operations are 
performed. 



42 



Structure Storage using cm:get and cm: send-with-add 

With the introduction of the cm: get and cm: send-with-add CM macro instructions, our 
design for the structure storage becomes much simpler. The CAR operation is implemented by 
doing a cm: get operation. In other words, we can freely propagate copies of pointers through, 
without worrying about what happens when several processors try to fetch the CAR of a cons 
cell at the same time. 

Garbage collection becomes a little trickier however. A reference counted garbage collec- 
tion scheme would work, just as it did for the fan-in tree implementation, however, since this 
design does not require that reference counts be maintained just to keep track of how to copy 
pointers, we might decide on another garbage collection technique, such as mark and sweep 
(which was briefly explored in[Chr84]). We decided to use reference counting garbage col- 
lection, partly because we understand it, and partly because it helps to maintain similarities 
between the various designs for the structure storage. 

Given these decisions, the structure storage has a very simple design. We use cm: get to 
perform the CAR and CDR operations. We use cm: send-with-add to increment and decrement 
the reference count. The APPENDCAR operation can be implemented by doing a cm: get on the 
CDR, and incrementing the CDR's reference count, then doing a cm: cons to get a new PE, and 
storing the new CAR and the 'copied' CDR in the new PE. 

The fan-in trees are still implicitly being used, but because the fan-in trees are implemented 
at a very low level, the implementation of our simulator becomes much more simple, and runs 
much more quickly. 

3.2 VIM Style Dynamic Data Flow 

As Will Rogers would have said, 'there is no such thing as a free variable.' 

Another proposed data flow architecture, which we will refer to as VlM style dynamic data 
flow, is described in[DSG85l. VlM allows functions to be treated as first class data values: 



43 



They can be passed to functions and stored in structures. One result of this treatment 
of function values is that recursive programs can be written[Kus84]. The VlM system was 
originally designed to be run on a MIMD system such as a network of lisp machine processors. 
This section provides an abstract description of VlM, and describes the requirements for the 
'primitive objects' that we will use to simulation VlM. We conclude this section with a 
description of our simulation. 

3.2.1 What is VIM 

The VlM system allows functions to be treated as data, and provides several other facilities 
such as early completion structures and guardians[DSG85]. Given the mechanism that we will 
develop in this section, it would be fairly straightforward to implement both early completion 
structures and guardians on our data flow simulators, but we have not done so. 

We will thus consider only the treatment of functions as data. In applicative languages 
data values are immutable, and so in VlM style dynamic data flow functions as data are 
immutable. On the other hand data flow graphs are mutated while being interpreted (i.e. 
the state of the program is in the data flow graph). This means that we need to distinguish 
between functions as data and functions as data flow graphs. As proposed in[DSG85], VlM 
uses function templates to represent the function value as a datum. To apply a function to 
some arguments requires that the template be copied into otherwise unshared memory and 
'transformed' into a data flow graph. The resulting data flow graph will have been dynamically 
allocated from the memory heap, and will eventually need to be deallocated. 

The VlM system is defined as an abstract machine rather than a programming language. 
A compiler is needed to translate an applicative program into a data flow graph. Thus, to sup- 
port recursive programs in an applicative language it is sufficient to provide functions as data 
values, since recursive programs can be implemented by passing functions as values[Kus84]. 

All functions in our world will be unary, i.e. they take exactly one argument. To simulate 
functions of more than one argument, we can either curry functions to achieve multiple ar- 



44 



gument functions, or we can package up the argument into an structure (from the structure 
storage) and pass them all as one argument. 

In general, we expect that several apply nodes will be trying to apply the same function 
to different arguments at the same time. In particular, there may be „ apply nodes trying to 
make a copy of a graph which consists of m PE's at the same time. We would like this graph 
copy to run quickly, and that is the main difference between implementing static data flow 
and VIM style dynamic data flow. 

Note that in VlM, any particular instance of a data flow node will fire at most one time 
(because all loops are implemented by using recursion). In particular any instance of an apply 
uode will only fire once. Thus, when the apply node ■splices' the destination address for the 
function's result into the function's graph, there is no problem with values arriving at the 
destination out of order. 

3.2.2 The Primitives Used to Simulate VIM 

Our simulation of VIM allows functions to be treated as data values by dynamically allocating 
storage for the function, and providing an apply operator to create a new copy of some 
subgraph of the program graph. The apply operator can be part of a data flow graph JU st 
like any other operator, such as the add operator. 

In order to support lexically closed function values, we introduce another data type, the 
closure By the time the compiler has translated an applicative program into a data flow 
gra ph the variables per *e are gone. A closure is a value which can be combined with another 
value (the value being "bound" to the "variable") to yield some other value (either a closure 
value or a function value, in which the "variable" is "bound"). The compiler is then able to 
translate the functional programs into data flow graphs. See Figure 3.5 for an example of tins 

translation. 

To support closures, we introduced a new data type called closure. We define the CLOSE 
operation, and reference count maintenance operations on closure objects (and as usual these 



45 



i nmn JWm »jijB^^ 



ugmoatU 



(dote tuuUc rakw taw***) 




closure 



Figsr* »-5: Taadmtm* VmB* «ap«4 Hmt&m too do«i««. 



operation manifest themselves as data flow operators). The CLOSE operation may involve 
performing some copying. The reference count manipulation is done as for cons cells. 

We need a scheme to deallocate closures and function values when they are no longer 
needed, possibly either by maintaining a reference count on the data flow graph itself, or by 
causing the graph to deallocate itself automatically when it has completed its computation, 
as in[DSG85]. 

3.2.3 Implementing the VIM Primitives 

In order to simulate VlM style dynamic data flow we need to design a representation for 
data flow graphs which can be quickly transformed from a template into a data flow graph 
which can run on the machine. Our implementation uses just one representation both for 
the function as a datum (i.e. the template) and the function as a data flow graph being 
interpreted on the machine. The transformation from template to data flow graph is thus 
trivial, the only requirement being that the graph be copied so that the new copy is not shared 
with any other part of the system (it will then be safe to mutate the graph). 

The design we settled on is the following: A function value is represented by a set of PE's. 
The function graph deallocates itself using a release mechanism, as described in[DSG85]. Each 
PE represents an operator in the data flow graph of the function. The PE's have two graphs 
superimposed on them. The first graph is the directed graph which corresponds to the data 
flow graph of the function. The second graph is a balanced binary tree which contains all of 
the PE's representing the function. 

The data flow graph is used just like the data flow graph was used for the static data flow 
implementation. Our function objects understand only one operation, the APPLY operation. 
It turns out that we do not need a KILL operation, since the function templates can be kept 
around in the machine effectively forever (we are simulating a system which runs for a short 
period of time, and so we do not need to worry about deallocating the template), and copies 
of the template are made only during the APPLY operation, in which case the function will 



47 



deallocate itself with the release mechanism. 

The tree which is superimposed on the PE's is used to perform operations on the graph. 
The tree is a balanced binary tree, where the head of the tree knows how many vertices, x, 
are in the tree. The tree is used to perform operations such as the copy during the APPLY 
operation because a request can be propagated to the PE's in time O(logi). 

As stated above, we expect that many instances of each function will be applied in parallel. 
Suppose that n apply cells receive their data and can fire at the same time, and that they all 
receive the same function, /, to apply. Suppose further that there are m PE's in the template 
of /. Given fixed m, we require that any design be able to perform the application in time 
faster than linear in n, because otherwise the parallelism that is present in the program will 
be lost. We also would like to be able to apply different functions at the same time, and that 
the speed of copying be a slowly growing function of m (the number of nodes in the graph), 
even though both the number of different functions and the maximal m are fixed once the 
program has started running. The following design can apply n copies of a graph with m 
nodes in time which is polylogarithmic in n and m. 

A useful subroutine is the operation of making one copy of a template. This can be done 
by first identifying the graph (propagate the fact that a copy is to be made through the tree 
which is superimposed on the template PE's), and then by using the algorithm to copy a 
graph described in Section 2.2.5. 

The algorithm is as follows (See Figure 3.6): 

• First the n apply cells must arrange themselves into a binary tree. This can be done 
using cm: build-tree (which is described in Section 2.2.4). 

• The root of the tree makes one copy of the template for itself, and designates itself a 
copy-site. All the PE's which are not at the roots of their trees are not copy-sites to 
start with. 

• While there is a copy-site do the following: (Note that the cm: global-or operation can 
be used to find out if any PE's are currently copy-sites.) 



48 



For each copy-site, p, do in parallel: 

* If p has a left child, then make a copy of the graph, and send a pointer to the 
head of the copied graph to that left child. That left child becomes a copy-site. 

* If p has a right child, then make a copy of the graph, and send a pointer to 
the head of the copied graph to that right child. That right child becomes a 
copy-site. 

* Remove p from the process by making it no longer a copy-site. 




Level 



Level 1 



Level 2 



Each copy operation takes polylog(m) time 
There are log(n) copy operations 



Program Graph Edges 
Embedded Tree Edges 



Figure 3.6: It is possible to make n copies of a graph with m vertices in time which is polylog 
in n and m. 

Since the tree is roughly balanced, the while loop will run a polylog number of times in n, 
and the operation of making a single copy of graph with m vertices takes log m time. 



49 



The APPLY operation takes a function object and a value, copies the function object to 
create a copy which is not shared by any other part of the system, and modifies the graph 
to start it running: The argument is placed into the 'begining' graph, and the destination of 
the APPLY node is placed at the 'end' of the graph. The graph is then 'turned on' and starts 
running just as it did in the static data flow simulation. 

3.3 Combinator Reduction 

We simulate Traub's Abstract Combinator Reduction Architecture[Tra84] on the Connection 
machine by allocating one PE to each combinator application cell, and then sending the 
appropriate 'reduce', 'reduce-ack' and other messages between the PE's. Traub's architecture 
is particularly easy to simulate on the CM because it is easy to map combinator cells on to 
PE's with very few design decisions. 

We will describe the implementation of a combinator reduction system that performs the 
following reductions. 

(Ix) => x 

((Kx)y) =» x 

{((Sf)g)x) => ((fx)(gx)) 

(+ x y) =*• x + y 

We have chosen the above reductions because they include one mathematical operator (which 
is strict in its arguments) and the basic combinators from[Tur79]. Once the implementa- 
tion strategy for these combinators has been explained it should be easy to understand the 
implementation strategy for richer sets of combinators. 

Conceptually there is only one kind of cell in Traub's architecture, the application celL 
The application cell behaves as follows: 

There are two messages that the application cell recognizes: The reduce message, and the 
increment-reference-count message. The application cell has two variables in addition to 
the machinery for reference counts, the function and the argument. The application cell also 



50 



caches the result of the reductions that it performs. 

When an application cell receives a increment-reference-count message, which contains 
a signed integer t, the reference count is incremented by t. Note that if i is negative, the 
reference count is decreased. If the reference count is decreased to one, then the application 
cell can be deallocated, and the reference counts of its function, argument, and any cached 
values can all be decremented. 

When an application cell receives a reduce message it behaves as follows: 

• If the cell has already performed a reduce then it will have cached the result. The 
cached result is returned to the cell requesting the reduction. 

• Otherwise, we need to reduce the function. If the function is an atom (i.e., an /, K, 
S, + , or a number) then it is already reduced, otherwise we send it a reduce message 
and wait for the reply. If the reduced function is 

/ We reduce the argument, cache the result and return it. 

K or S or + We return the pair consisting of the reduced function and the unreduced 
argument. We cache that pair for future use. 

(K x) We can throw the argument away, so we send a (increment-reference-count 
-1) message to the argument, and cache x as our result, and return x (note that 
we have to increment the reference count of x.) 

(S x) Return the triple consisting of S, x, and the unreduced argument. 

(+ x) Reduce x and the argument in parallel and then add the results together, caching 
and returning the sum. 

{{$ f) 9) Cons up two new cells. Initialize the first new cell to contain / as its function 
and one of the copies of our argument as its argument. The first new cell will be our 
new function. The second new cell should be initialized so that g is its function 
and the second copy of our argument is its argument. The second new cell will be 



51 



our new argument. Now we go back and act as if we had just received a reduce 
message. 

anything else is an error. 

We need to be careful to correctly update the reference counts, and update them in the 
correct order. For example, it would not be correct to decrement the reference count of an 
object and then increment the reference count of another object, because if the object being 
decremented happened to be the same as the object being incremented, we could end up 
deallocating something we did not want to deallocate. If our decrements and increments 
appear in the correct order in our program, we are guaranteed that there will be no race 
conditions which cause improper deallocation of resources (since the CM is a SIMD machine). 

Traub[Tra84] proves that the above scheme will work without errors, assuming that the 
original combinator program has no type errors. Also note that the only parallelism that this 
architecture allows is in the reduction of arguments to strict operators. It may be possible 
to achieve better performance by allowing the reduction of other expressions to proceed in 
parallel. 



52 



Chapter 4 



StarTalk 



In this chapter we provide motivation for a special purpose language to simulate MIMD 
systems on the CM. We describe our language, which is called STARTALK, and discuss the 
optimizations that our compiler performs. 

Simulating MIMD machines on the CM is difficult without a special purpose language 
because the programmer is forced to write unstructured code with explicit labels. E.g., our 
original specification for cons cells was relatively small (about a page long), but after manually 
translating it into CM macro instructions with explicit manipulation of 'state' values (i.e. the 
'program counter' of the MIMD processor which we are simulating), the code had grown 
by more than a factor of ten. Furthermore, the original structure of the code had been 
lost, and because this process was expensive, when changes needed to made in the code, 
the temptation was great to modify the 'object code' instead of modifying the source code 
and manually recompiling. There are also a large class of optimizations which are infeasible 
to perform by hand. By designing a special purpose language we were able to maintain a 
compact and readable specification of our simulator, and perform optimizations. 

The STARTALK language is 'object oriented' in that the program specifies what any par- 
ticular object does. There are primitives to deal with message passing and local arithmetic, as 



53 



well as flow of control structures such as LOOP and IF. There is a static subroutine mechanism 
which allows non-recursive subroutines to be written. The STARTALK language is very much 
like CLl[Baw84] in that one thinks about the objects and how they behave. In CL1, it was 
absolutely necessary to avoid propagating copies of pointers to objects because the machine 
programming model used by CL1 includes only a very simple message passing primitive (i.e. 
it includes cm:send-data but not cm:send-with-logior or cm: get). In STARTALK, one 
need not necessarily worry about the propagation of copies of pointers to objects. 

Here we will describe the implementation of STARTALK, including the flow of control, the 
subroutine linkage mechanism, and the 'interpreter' cycle. 

The STARTALK compiler translates STARTALK programs into CM macro code by the 
following conventions. The compiler identifies basic blocks of code (i.e. code which is executed 
linearly) and gives each basic block a unique positive integer identifier in a compact interval 
starting with zero. Every processor has a field which can be thought of as the 'program 
counter' for that processor. When that field contains number t in a given processor, that 
processor 'wants' to run basic block number t. It is the responsibility of the code in basic 
block to set the 'program counter' for that processor to the block number of the next basic 
block which is to be executed. This value can be set to one of several values to simulate 
conditional statements, or it can be used to simulate loop statements. In fact, since the 
block number is simply data, the block number can be stored or computed in an arbitrary 
manner and used to simulate 'indirect' jumps. The subroutine linkage mechanism uses this to 
an advantage. Since STARTALK subroutines are static (Le. there is no recursion or mutual 
recursion) it is possible to statically assign a location to be used for subroutine linkage. This 
location can be used to pass arguments, return results and store the program counter for the 
next basic block to be executed after the subroutine returns. 

To interpret a compiled STARTALK program, the host computer can step through all of 
the basic block numbers, selecting the processors which 'want' to run a given basic block, and 
then broadcasting the instructions for that basic block. The broadcasted instructions will 



54 



have set the program counters for those processors to new values and when the interpreter 
cycles to the other block numbers the processors will make more progress. 

It is possible to perform optimizations to improve the quality of the code. None of the 
optimizations descibed below are present in the current version of the S TART ALK compiler. 

Variable allocation can be optimized by doing global data flow analysis on the code. Local 
processor memory is a scarce resource on the connection machine. By performing global data 
flow analysis, it is possible to reduce the amount of memory actually needed to store the state 
of a STARTALK object. By reducing this memory requirement, we can run with more virtual 
processors and support a larger number of active objects at one time. This optimization 
was actually implemented, but is not used in the present version of the STARTALK compiler 
because we were more interested in quick prototyping than running large systems. 

Common subexpressions can be factored out, especially the most expensive common subex- 
pressions, into subroutines. For example, if one set of PE's, A, needs to perform a float- 
ing point multiply, and another set of PE's, B, needs to perform a floating point add, the 
START ALK compiler can arrange for the normalization phase for both A and B to be carried 
out in parallel, by doing the actual multiplication while A is selected, and the actual addition 
while B is selected, and then normalizing with both A and B are selected. Currently, it is 
necessary for the programmer to explicitly place code into subroutines if such sharing is to 
be achieved. The cases where we actually use subroutines to improve the efficiency are code 
segments containing message sending code and code such as as cm: cons. 

The object code generated by S TART ALK looks bad. We suspect that a large improvement 
could be made by performing local optimizations, such as peephole optimization, and handling 
special cases such as when values are constants more consistently. 



55 



Chapter 5 



Conclusion 



In this chapter we will describe and analyze the preliminary measurements of the perfor- 
mance of our simulators, and conclude with a discussion of possibilities for further work and 
exploration of remaining open questions. 

We have implemented two of the simulators described in this paper, The static data 
flow simulator, and the combinator reduction simulator. The VlM style dynamic data flow 
simulator has not been implemented. Our experiences confirm the widely held belief that 
programming parallel machines is not easy: Both the implementation of the STARTALK com- 
piler, and the simulators in STARTALK were nontrivial, and our initial attempt to implement 
the simulators directly in the connection machine primitives was so difficult that we could 
not get it to work at all. One of the reasons that we have not implemented the VlM style 
dynamic data flow simulator is that these simulators are relatively difficult to write. 

5.1 Performance of the Simulators 

We would like to say that our simulators give high performance, but it is difficult to define 
what we mean by 'high performance' on simulators, especially when comparing hardware 



56 



designed especially for an architecture to the simulation of the architecture. This is especially 
true because it is always possible that our simulators are not as good as they could be, (e.g. 
it is possible that we have not done as good a job as we might have) which means that we can 
only give a good lower bound on the performance of the connection machine as a simulator. 

We could measure several aspects of our simulator's efficiency, including the percentage of 
PE's which are being used at any time, the number of primitive operations per second (i.e., 
for static data flow we count the number of node firings per second, for dynamic data flow we 
also count the number of function applications per second, and for combinators we count the 
number of reductions per second), or the speed to run certain programs. 

We chose to measure the number of primitive operations per second for several essentially 
serial programs. The reason that we chose to measure serial programs is to explore the worst 
case behavior of our simulators. Clearly if a program has much parallelism, we will achieve 
more primitive operations per second than if the program has little parallelism. The goal of 
this paper is not to explore the improvement to be gained from exploiting the parallelism in 
programs, but is rather to try to understand how to simulate parallel applicative architectures 
on the connection machine. We therefore decided to measure essentially serial programs in 
order to be able to compare our simulators to other implementations of applicative architec- 
tures. 

There are several reasons that the decision to measure serial programs might be a bad one. 
The most important such reason is that if we were to compare the simulation of essentially 
serial programs on the connection machine to the simulation of the same program on a fast 
serial machine, then our results would be misleading. The serial machine will clearly have an 
advantage on serial programs, while if we measured parallel programs, we would be able to 
compare the two implementations in a broader context of their 'real' performance. 

We have chosen to measure serial programs in spite of those objections because it is 
relatively easy to measure and understand such programs, and because of time constraints. 
Figure 5.1 shows the results of our measurements for the static data flow and combinator 



57 



Static Data 


Total Number of 


Time in 


Node Firings 


Flow Program 


Node Firings 


Seconds 


Per Second 


Iterl 


16 


«2 


«8 


Iter5 


41 


«5 


«8 


Iter50 


356 


«34 


w 10 


Consl 


9 


< 1 


>9 


Cons2 


38 


«5 


w7 



Combinator 


Total Number of 


Time in 


Node Firings 


Program 


Node Firings 


Seconds 


Per Second 


(Dfib 1) 


62 


11.3 


5.5 


(Dfib 2) 


130 


«22 


«5.9 


(Sfib 1) 


142 


«26 


« 5.4 



See the text for a description of the programs 



Figure 5.1: Performance Measurements on a few programs. 

reduction simulators. 

We measured two classes of programs for the static data flow simulator: The first class is 
an iteration: The program named iter5 was five iterations of a simple addition, while the 
program named iterl was one iteration of the same program, and so on. The second class 
was a pair of programs which did some consing, and they appear in Figure 5.2. 

We measured both the doubly recursive, and the singly recursive implementation of the 
Fibonacci function on the combinator simulator. The doubly recursive Fibonacci program is 
named DFib, and the singly recursive Fibonacci program is named SFib. The doubly recursive 
program has a lot of parallelism in it. 

Our results indicate that our simulators are pretty slow. The number of firings in a static 
data flow graph is not directly comparable to the number of reductions in a combinator 
simulator, since every firing in the static data flow graph does seems to do more work than a 



58 



(defun consl () 

(let ((x (cons 12))) 
(+ (car x) (cdr x)))) 

(defun cons2 () 

(let ((x (cons (cons 1 2) (cons 3 4)))) 
(+ (+ (car (car x) ) 
(cdr (car x))) 
(+ (car (cdr x)) 
(cdr (cdr x)))))) 



Figure 5.2: Static data flow programs to measure consing. 

combinator reduction: Many of the reductions in a combinator program are just the routing 
of arguments to the body of a function, while most of the firings in a static data flow graph 
are more related to the actual work which we want done. These results also indicate that 
relatively few of the PE's which are in a graph are actually active at any time. 

We found that it was very easy to run out of processors for programs with much parallelism. 
For example, on a 512 processor connection machine, we ran out of processors while computing 
(DFIB 5). 

The preliminary measurements show that the static data flow interpreter is much faster 
than the combinator reduction interpreter. On the other hand, the static data flow paradigm 
is not as powerful as the combinator paradigm, because for example, it is not possible to write 
higher order functions or recursive functions in a static data flow language. We believe that 
the static data flow interpreter is inherently faster than combinator interpreters, because to 
do very simple operations with combinators can require a lot of work. 

Supercombinators [Hug82] are one proposed solution which reduces the amount of work 
done for the simple operations, but the connection machine is not very well suited to simulating 
supercombinators: In general, the more rules there are to interpret, the longer the interpreter 
cycle takes to give all the nodes a chance to fire, and supercombinator compilers tend to 
produce a lot of rules. 



59 



5.2 Qualitative Results 

We are able to run applicative programs, which allows us to gain experience with these 
programs. We are able to quickly test new ideas about the design of applicative architectures 
and other MIMD machines. These simulators also demonstrate that in the connection machine 
is a general purpose, efficient, scalable, parallel computer (as defined in Chapter 1), because 
it can run applicative programs efficiently. 

We have found that the static data flow simulator runs a given program more quickly on 
our simulators than the combinator reduction simulator does. Thus, if a given program can 
be written in a static data flow language, such as VAL, it should be run on a static data flow 
machine. 

One important lesson that we have learned is that the choice of message sending primitives 
can make a big difference on how easy it is to write programs. The use of cm: get and cm: Bend- 
with-add resulted in much simpler programs which were much easier to write, debug, and 
understand. In essence, these operations helped the connection machine look like a SIMD 
machine with a globally accessible memory (i.e. a PRAM) rather than a message passing 
SIMD machine. 

We noticed that not very many PE's were really active at any given time. Presumably 
it would help the performance of these simulators to have more PE's active. Since we have 
mapped each primitive object in our programming model into a single PE (e.g. we mapped 
each node of a data flow graph, each cons cell, and each function application cell into one 
PE), there is not much we can do to increase the number of active PE's. It is possible that 
with an efficient implementation of virtual processors (see Section 2.2.6) we could use a higher 
percentage of the real PE's. The main architectural obstacle preventing efficient support of 
virtual processors is that each physical PE in the connection machine must address the same 
memory location at the same time. This means that if a virtual PE with its memory stored at 
location in physical PE number 5, and another virtual PE with its memory stored at location 
128 in physical PE number 10 both want to run at the same time, that the microcontroller 



60 



will have to issue the instructions twice: once for each virtual processor bank. It would help 
if each PE could access a different memory location, because then different banks could run 
on the different processors at the same time. 

Clearly, it would help if the connection machine were a MIMD machine, since each PE 
could simulate a different kind of node more efficiently, and it would be feasible to implement 
supercombinators on the connection machine. Of course, it is not clear that if the connection 
machine were somehow changed into a MIMD machine that it would be anything like what 
we currently think of as a connection machine. 

5.3 Future Work 

Our simulators for static data flow and combinator reduction are running on the connection 
machine, and have largely met our goals. Our implementation for the VlM style dynamic 
data flow simulation has not been completed. 

We are still interested in comparing the performance of VlM style dynamic data flow to 
combinators. It makes no sense to compare VlM style dynamic data flow with static data 
flow because static data flow is not as powerful as dynamic data flow, and anything that can 
be written in a static data flow language should run at least as quickly on a static data flow 
interpreter as on a VlM style dynamic data flow interpreter. For about the same reasons it 
makes little sense to compare combinator reduction with static data flow. However, it does 
make sense to compare combinator reduction to VlM style dynamic data flow. 

The remaining problems that we have not answered to our satisfaction are 

• What is the best way to limit the consumption of resources while allowing the parallelism 
of programs to be exploited. 

• What is the tradeoff between a few simple combinators (which may require many reduc- 
tions in order to do a given amount of work), and many powerful combinators (which 
require fewer reductions, but the interpreter cycle becomes longer)? Where in this 



61 



spectrum is the optimum? 

• What is the best way to get the most computation per dollar spent on hardware. In 
SIMD architectures, this translates to a problem of using a high percentage of the PE's, 
and while our STARTALK compiler addresses this issue by performing optimizations, 
we have yet to see conclusive results on the design tradeoffs between various SIMD 
architectures such as the connection machine and MIMD architectures such as the tagged 
token data flow system[ACI*83]. 

• How bad is the code generated by the STARTALK compiler? Qualitatively, the code 
looks bad. Presumably, cleaning up the code would help improve the performance 
of our simulators. It is also possible that special purpose microcode would help the 
performance of our simulators. Further exploration of these efficiency issues is needed. 

• What changes to the connection machine architecture would help applicative languages 
run quickly on the connection machine? Would a connection machine with such changes 
still be a connection machine? 



62 



Bibliography 



[ACI*83] Arvind, D. E. Culler, R. A. Iannucci, V. Kathail, K. Pingali, and R. E. Thomas. 
The tagged token data flow architecture. 1983. Preliminary version for distribution 
in subject 6.843 at the Massachusetts Institute of Technology. 

[AD79] William B. Ackerman and Jack B. Dennis. VAL - A Value Oriented Algorith- 
mic Language: Preliminary Reference Manual. Technical Report MIT/LCS/TR- 
218, Massachusetts Institute of Technology, Laboratory for Computer Science, June 
1979. 

[ADI83] Arvind, Michael L. Dertouzos, and Robert A. Iannucci. A Multiprocessor Emu- 
lation Facility. Technical Report MIT/LCS/TR-302, Massachusetts Institute of 
Technology, Laboratory for Computer Science, October 1983. 

[AI85] Arvind and R. A. Iannucci. Two Fundamental Issues in Multiprocessing: The 
Dataflow Solution. Laboratory for Computer Science, Computation Structures 
Group Memo 226-3, Massachusetts Institute of Technology, August 1985. 

[Arv] Arvind. Personal Communication. 

[Baw84] Alan Bawden. A Programming Language for Massively Parallel Computers. Mas- 
ter's thesis, Massachusetts Institute of Technology, Department of Electrical Engi- 
neering and Computer Science, 1984. 



62 



[BRR] Guy Blelloch, Abhiram Ranade, and John Rose. Personal communication. Guy 
Blelloch is currently a graduate student at M.I.T., Abhiram Ranade is currently a 
graduate student at Yale, and John Rose is currently employed at Thinking Ma- 
chines Corporation, Cambridge, MA. 

[Chr84] D. P. Christman. Programming The Connection Machine. Master's thesis, Mas- 
sachusetts Institute of Technology, Department of Electrical Engineering and Com- 
puter Science, 1984. 

[Den74j J. B. Dennis. First version of a data flow procedure language. In Lecture Notes in 
Computer Science, pages 362-376, Springer- Verlag, 1974. 

[Den80] J. B. Dennis. Data flow supercomputers. IEEE Computer, 48-56, November 1980. 

[DGK86] Susan Dickey, Allan Gottlieb, and Richard Kenner. Using vlsi to reduce serialization 
and memory traffic in shared memory parallel computers. In Charles E. Leiserson, 
editor, Proceedings of the Fourth MIT Conference on Advanced Research in VLSI, 
pages 299-316, April 7-9 1986. 

[DSG85] Jack Dennis, Joseph Stoy, and Bhaskar Guharoy. Vim: an experimental multiuser 
system supporting functional programming. In Proceedings of the International 
Workshop on High-Level Computer Architecture at Los Angeles California, May 
23-25 1985. 

[HU85] W. D. Hillis. The Connection Machine. The MIT Press, Cambridge, MA, 1985. 

[Hug82] R. J. M. Hughes. Super-combinators. In Proeedings of the 1982 Lisp and Functional 
Languages Conference, 1982. 

[Kus84] Bradley C. Kuszmaul. Type Checking in VimVal. Technical Report MIT/LCS/TR- 
332, Massachusetts Institute of Technology, Laboratory for Computer Science, May 
1984. 



64 



[Kus85j Bradley C. KuszmauL Fast deterministic routing on hypercubes with small buffers. 
Term paper for Ron Rivest's graduate class on Advanced Algorithms. The paper is 
available from the author at the MIT Laboratory for Computer Science, Cambridge, 
MA 02142, December 1985. 

[Tra84] K. R. Traub. An abstract architecture for parallel graph reduction. B.S. Thesis, 
Massachusetts Institute of Technology, Department of Electrical Engineering and 
Computer Science, 1984. 

[Tur79] D. A. Turner. A new implementation technique for applicative languages. Software- 
Practice and Experience, 9:32-49, 1979. 



65 



Biographic note: Bradley "Bradley Bear* C. Kuszmaul was born in Iowa City, Iowa, 
where at the age of one year, he escaped from the watchful eyes of his parents, went outside, 
and got lost in the snow. His parents found him and took him to Kansas City, where he 
lived until migrating to Boston to attend M.I.T. Bradley Bear double majored in math and 
computer science at M.I.T. , graduating in 1984 he won the runner up for best undergraduate 
thesis in the department of computer science {Type Checking in VlMVAL), which is probably 
the only reason that M.I.T. let him enter graduate school. He has subsequently written two 
papers: The Suitability of the Connection Machine for Simulating Applicative Architectures, 
and Fast Deterministic Routing on Hypercubes with Small Buffers. Bradley Bear received his 
nickname from mathematician Kimberly Kay Lewis, to whom he is now married. 



66